Add AG/RS overlap distributed init support#2487
Add AG/RS overlap distributed init support#2487jeffnvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis change adds a new configuration field Changes
Sequence Diagram(s)sequenceDiagram
participant Config as DistributedInitConfig
participant InitDist as _initialize_distributed
participant CreatePG as _create_pg_collection
participant ParallelState as parallel_state
participant PGCollection as ProcessGroupCollection
Config->>InitDist: create_all_gather_group flag
InitDist->>CreatePG: create_all_gather_group=True
alt Decentralized Path
CreatePG->>ParallelState: create_all_gather_groups()
ParallelState->>PGCollection: populate AG groups
else Centralized Path
CreatePG->>CreatePG: build dp_cp_ag_pg
CreatePG->>CreatePG: build expt_dp_ag_pg (if expert parallel)
CreatePG->>PGCollection: attach dp_cp_ag & expt_dp_ag
end
PGCollection-->>InitDist: return with AG groups
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 7f4e63f |
7f4e63f to
dc30cd1
Compare
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
Signed-off-by: jeffnvidia <jmahou@nvidia.com> Made-with: Cursor
dc30cd1 to
ea44a27
Compare
| # Create AG groups if requested | ||
| if dist_config.create_all_gather_group: | ||
| for_expert_parallelism = (getattr(model_config, "expert_model_parallel_size", 1) or 1) > 1 | ||
| dp_cp_ag, expt_dp_ag = parallel_state.create_all_gather_groups( | ||
| for_expert_parallelism=for_expert_parallelism, | ||
| timeout=datetime.timedelta(minutes=dist_config.distributed_timeout_minutes), | ||
| nccl_comm_cfgs=None, # Could use dist_config.nccl_communicator_config_path if needed | ||
| ) | ||
| # Get ProcessGroupCollection and populate with AG groups | ||
| pg_collection = ProcessGroupCollection.use_mpu_process_groups() | ||
| pg_collection.dp_cp_ag = dp_cp_ag | ||
| if expt_dp_ag is not None: | ||
| pg_collection.expt_dp_ag = expt_dp_ag | ||
| return pg_collection |
There was a problem hiding this comment.
Note to self: Adding this to the PG collection will be passed to the Megatron-FSDP DDP FullyShardedDataParallel adapter which then passes it to the FSDPDistIndex / MegatronFSDP class API.
| if create_all_gather_group: | ||
| # Create regular DP all-gather group with same ranks as dp_cp_pg | ||
| # Use HyperCommGrid to enumerate ranks for dp-cp groups | ||
| dp_cp_rank_lists = grid._gen_rank_enum(["dp", "cp"]) |
There was a problem hiding this comment.
why "grid.create_pg(...)" not working? ideally shouldn't use internal api here, bit risky.
There was a problem hiding this comment.
From what I can tell, grid.create_pg(["dp", "cp"]) can't be used here because it was already called on line 415 to create dp_cp_pg.
If I understand correctly, calling it again would raise a KeyError — create_pg keys by dimension names only (line 151 of hyper_comm_grid.py), so any second call with ["dp", "cp"] would collide with the existing "dp-cp" entry, regardless of group_desc or pg_options.
The AG group needs the same ranks but as an independent NCCL communicator, so I used _gen_rank_enum to get the rank lists and passed them to new_subgroups_by_enumeration directly. Same situation for the expert grid on line 516.
That said, I'm not super familiar with the HyperCommGrid internals — happy to refactor if there's a preferred way to create a second PG with the same rank topology ?
What does this PR do ?
Enable AG/RS overlap process-group plumbing in Megatron-Bridge distributed initialization by adding support for
create_all_gather_group, including expert-parallel all-gather groups whenEP > 1, and wiring these groups intoProcessGroupCollection.Changelog
create_all_gather_group: bool = FalsetoDistributedInitConfiginsrc/megatron/bridge/training/config.py.src/megatron/bridge/training/initialize.pyto threadcreate_all_gather_groupthrough distributed init.HyperCommGrid) path, create and attach:dp_cp_ag(DP+CP all-gather group)expt_dp_ag(expert-DP all-gather group when EP is enabled)parallel_state.create_all_gather_groups(...)after model-parallel init and attach returned AG groups toProcessGroupCollection.create_all_gather_group=False(no AG group creation, existing behavior preserved).pyproject.tomllocal testing override is excluded).GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit